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Abstract 

Deciding whether a model provides a good description of data is often based on a 
goodness-of-fit criterion summarized by a p-value. Although there is considerable con- 
fusion concerning the meaning of p-values, leading to their misuse, they are nevertheless 
of practical importance in common data analysis tasks. We motivate their application using 
a Bayesian argumentation. We then describe commonly and less commonly known dis- 
crepancy variables and how they are used to define p-values. The distribution of these are 
then extracted for examples modeled on typical data analysis tasks, and comments on their 
usefulness for determining goodness-of-fit are given. 



1 Introduction 



Progress in science is the result of an interplay between model building and the testing of models 
with experimental data. In this paper, we discuss model evaluation and focus primarily on 
situations where a statement is desired on the validity of a model without explicit reference to 
other models. We introduce different discrepancy variables^for this purpose and define p- values 
based on these. We then study the usefulness of the p- values for passing judgments on models 
with a few simple examples reflecting commonly encountered analysis tasks. 

In the ideal case, it is possible to calculate the degree-of-belief in a model based on the 
data. This option is only available when a complete set of models and their prior probabilities 
can be defined. However, the conditions necessary for this ideal case are usually not met in 
practice. We nevertheless often want to make some statement concerning the validity of the 
model(s). We then are left with using probabilities^] of data outcomes assuming the model 
to try to make some judgments. These probabilities can be determined deductively since the 
model is assumed, and therefore frequencies of possible outcomes can be produced within the 
context of the model. These can then be used to produce frequency distributions of discrepancy 
variables, and p- values (one-sided probabilities for the discrepancy variables) can be calculated 
using the distributions and the observed values. The use of p-values has been widely discussed 
in the literature [|2j and many authors have commented that p- values are frequently misused in 
claiming support for models [3|. We give a Bayesian argumentation for the use of p- values to 
make judgments on model validity, and it is in this Bayesian sense that we will use p-values. 

We start with a review of model testing in a Bayesian approach when an exhaustive set 
of models is available and a coherent probability analysis is possible. We then move to situ- 
ations where this is not the case, and review some possible choices of discrepancy variables 
for Goodness-of-Fit (GoF) tests. Next, we define the p- values for the discrepancy variables and 
evaluate their usefulness for several example data sets. Note that we omit a discussion of Bayes' 
factors since these require the definition of at least two models for evaluation. 



2 Full set of models available 



2.1 Formulation 

Assume we have a complete set of models available for describing the data, such that we are sure 
the data can be described by one of the models available. The models can be used to calculate 
'direct probabilities'; i.e., relative frequencies of possible outcomes of the results were one to 
reproduce the experiment many times under identical conditions. The probability of a model, 
M, is denoted by P(M), with 

< P(M) < 1 , (1) 

'A discrepancy variable |T] is an extension of classical test statistics to allow possible dependence on unknown 
(nuisance) parameters. 

2 We use probability to also include probability density. 
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while the probability densities of the model parameters, 9, are typically continuous functions^] 
In the Bayesian approach, the quantities P(M) and P(9\M) are treated as probabilities, al- 
though they are not frequency distributions and are more accurately described as 'degrees-of- 
belief (DoB). This DoB is updated by comparing data with the predictions of the models. 
P(M = A) = 1 represents complete certainty that A is the model which describes the data, and 
P(A) = represents completely certainty that A is not the correct model. 

The procedure for updating our DoB using experimental data is 

P i+l (9, M\D) oc P(x = D\6, M)Pi(6, M) , (2) 

where the index on P represents a 'state-of-knowledge' . The posterior probability density func- 
tion, Pj+i, is usually written simply as P, and the prior is written as Pq. The posterior describes 
the state of knowledge after the experiment is analyzed. The quantity P(x = D\9, M) repre- 
sents a probability of getting the data D given the model and parameter values, and can usually 



be defined in a number of ways (see for example Section 4.3 ). 
Normalizing |2} and using 



P(D) = J2 / P{x = D\9, M)P (9, M)d9 



yields 



P(9,M\D) = P ^ M l P ^ M K (3) 

P(D) 

This is the classic equation due to Bayes and Laplace Q. 

Models can be compared and the DoB in a model can be obtained using 

P(M\D) = J P(9,M\D)d9 . (4) 

This evaluation requires the specification of a full set of models and a definition of the prior 
beliefs such that 

£)Po(M) = l . (5) 

M 

It is very sensitive to the definitions of the priors when the data are not very selective. 



2.2 Example with full set of models 

An example where this approach was used is given in pj, and is reviewed here. The analysis 
is to be performed on an observed energy spectrum, for which we can form a background- 
only hypothesis, or a background+signal hypothesis. An example is the search for neutrinoless 
double beta decay. 

3 Note that there is in principle no mathematical distinction between model and parameters. In practice, we 
distinguish them because models are fixed constructs for which we evaluate the degree-of-belief that the model is 
correct, whereas parameters can take on a range of values and the analysis is used to extract a degree-of-belief for 
a particular value. 
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In the following, H denotes the hypothesis that the observed spectrum is due to background 
only; the negation, interpreted here as the hypothesis that the signal process contributes to the 
spectrum^} is labeled H. The posterior probabilities for H and H can be calculated using 

n .„, . P(spectrum|PO ■ P (H) 

P(H spectrum = ' J 0K } (6) 

P (spectrum) 

and 

. , P(spectrum|PT) • P (H) 

P(H spectrum = ' ; 0K } . (7) 

P(spectrum) 

The probability P(spectrum) is 

P(spectrum) = P(spectrum|P) • P (H) + P( spectrum \H) ■ P (PT) . (8) 

The probabilities for a given spectrum can be calculated based on assumptions on the signal 
strength and background shape as described in pj. Evidence for a signal or a discovery is then 
decided based on the resulting value for P(H\ spectrum). 

In this analysis, it was assumed that the observed spectrum must come from either the 
background model or a combination of the background and double beta decay signal. The 
probability of each case is evaluated and conclusions are drawn from these probabilities. 



3 Incomplete set of models 

In most cases, we analyze data without having an exhaustive set of models available, but never- 
theless want to reach conclusions on how well the models account for the data. This information 
can be used, for example, to guide the search for new models. In the example given above, it is 
possible that there are unknown sources of background for which predictions are not possible 
before the experiment is performed. The quantities P (spectrum \H) and P ( spectrum \H) could 
be individually examined and, if both are on the small side of the expected distribution, doubts 
concerning the completeness of our set of models could arise. 



3.1 General approach to GoF tests 

For a given model, we can define one or more discrepancy variable(s) and calculate the expected 
frequency distribution of this discrepancy variable. If the discrepancy variable is well chosen, 
then the distribution for a 'good' model should look significantly different than for a 'bad' 
model. Finding the discrepancy variable in the region populated by incorrect models then gives 
us cause to think our model is not adequate. 

4 Since the shape of the background spectrum is assumed to be known the case of unknown background sources 
contributing to the measured spectrum is ignored. However, the overall level of background is allowed to vary. 
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3.2 Definition of a rvalue 



A p-value is the probability that, in a future experiment, the discrepancy variable will have a 
larger value (indicating greater deviation of the data from the model) than the value observed, 
assuming that the model is correct and all experimental effects are perfectly known. In other 
words, not only is the model the correct one to describe the physical situation, but correct dis- 
tribution functions are used to represent data fluctuations away from the 'true values'. We will 
focus on GoF tests for the underlying model, but it should be clear that incorrect formulations 
of the data fluctuations will bias the p-value distributions to lower (if the data fluctuations are 
underestimated) or higher (if the data fluctuations are overestimated) values. 

In general, any discrepancy variable which can be calculated for the observations can be 
used to define a p-value. We use R(x\9, M) and R(D\9, M) to denote discrepancy variables 
evaluated with a possible set of observations x for given model and parameter values, and for 
the observed data, x — D, respectively. To simplify the notation, we will occasionally drop 
the arguments on R and use R D to denote the value of the discrepancy variable found from the 
data set at hand. R can be interpreted as a random variable (e.g., possible x 2 values for a given 
model), whereas R D has a fixed value (e.g., the observed x 2 derived from the data set at hand). 

Assuming that smaller values of R imply better agreement between the data and model 
predictions, the definition of p (for continuous distributions of R) is written as: 



The quantity p is the 'tail-area' probability to have found a result with R(x) > R D , as- 
suming that the model M and the parameters 9 are valid. If the modeling is correct (including 
that of the data fluctuations), p will have a flat probability distribution between [0, 1]. For dis- 
crete distributions of R, the integral is replaced by a sum, the p-value distribution is no longer 
continuous, and the cumulative distribution for p will be step-like. 

If the existing data are used to modify the parameter values, the extracted p-value will be 
biased to higher values. The amount of bias will depend on many aspects, including the number 
of data points, the number of parameters, and the priors. We can remove the bias for the number 
of fitted parameters in x 2 fits by evaluating the probability of R = x 2 f° r N — n degrees-of- 
freedom, P(x 2 \N — n), where N is the number of data points and n is the number of parameters 
fitted @, when 

• the data fluctuations are Gaussian and independent of the parameters, 

• the function to be compared to the data depends linearly on the parameters, and 

• the parameters are chosen such that x 2 is at its global minimum. 

In general, the bias introduced by the number of fitted parameters becomes small if iV 3> n. 

p- values cannot be turned into probabilistic statements about the model being correct with- 
out priors, and statements of 'support' for a model directly from the p- value behave 'incoher- 
ently' [|3j. Furthermore, approximations used for the distributions of the discrepancy variables, 




(9) 
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biases introduced when model parameters are fitted and difficulties in extracting reliable infor- 



mation from numerical algorithms used to evaluate the discrepancy variable (see section 4.1.2) 
further complicate their use. p-values should therefore be handled with care. Nevertheless, we 
discuss the use of p- values to make judgments about the models at hand. The judgment will be 
based on a sequence of considerations of the type: 

• the p- value distribution for a good model is expected to be (reasonably) flat between [0, 1] ; 

• the p-values for bad models usually have sharply falling distributions starting at p = 0; 

• small p-values are worrisome; if we know that other models can be reasonably constructed 
which would have higher p- values, then a small p-value for the model under consideration 
indicates that we may have picked a poor model; 

• if the p- value is not too small, then our model is adequate to describe the existing data. 



3.3 Bayesian argumentation 

We contend that the use of p-values for evaluation of models as just described is essentially 
Bayesian in character. Following the arguments given above, assume that the p- value probabil- 
ity density for a good model, M , is flat, 

P(p\M Q ) = 1 , 

and that for poor models, Mj (i = 1 . . . k), can be represented by 

P{p\Mi) « Ci e- C > p 

where c, 3> 1 so that the distribution is strongly peaked at and approximately normalized to 
1. The DoB assigned to model M after finding a p- value p is then 

P(MM = pwwm 

If we take all models to have similar prior DoBs, then 

P(p\M ) 



P(M \p) 



In the limit p — > 0, we have 



while for c(p 3> 1 Vz > 



P (Mo\p) « — ~ k « 1 

1 + Ei=i 



P(M \p) « 1 



Although this formulation in principle allows for a ranking of models, the vague nature 
of this procedure indicates that any model which can be constructed to yield a reasonable p- 
value should be retained. A further consideration is that the correct distributions for the data 
fluctuations are often not known (due to the vague nature of systematic uncertainties) and best 
guesses are used. This will generally also lead to non-flat p- value distributions for good models. 

Scientific prejudices (Occam's razor, elegance or esthetics, etc.) will influence the decision 
and act as a guide in selecting the 'best' model in cases where several good models are available. 
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3.4 Discrepancy variables considered in this paper 



3.4.1 x 2 test f° r data with Gaussian uncertainties 

For uncorrelated data assumed to follow Gaussian probability distributions relative to the model 
predictions, the x 2 value is a natural discrepancy variable for a GoF test: 



where the iV data points are given by {(xi, yi)}, and the prediction of the model for yi is 
f(xi\9, M). The modeling of the data fluctuations uses fixed standard deviations <t,. 

If the parameters of the model are fitted to the data by minimizing x 2 an d there are n pa- 
rameters we replace 9 in the formula above with 9*. The x 2 probability distribution is evaluated 
for iV — n 'degrees-of-freedom'. In the special case where / is linear in the parameters, this 
procedure again yields a flat p- value distribution between [0, 1]. 

3.4.2 Runs test 

The standard x 2 test does not take into account clustering of data below or above expectations. 
To detect clusters the ordered set of N observations {(xi,yi)} is partitioned into subsets con- 
taining the success and failure runs (defined as sequences of consecutive yi above or below the 
expectation from the model, f(xi\9, M), respectively). Several discrepancy variables based on 
success runs can be found in the literature [7| but these do not take into account the size of the 
deviation, y { — f(xi\9, M). Recently, a discrepancy variable for runs incorporating this extra 
information was proposed for ordered data with Gaussian fluctuations (SJ. 

Let Aj denote the subset of the observations of the j th success run. The weight of the j th 
success run is then taken to be 



i=j'i 1 

where the sum over i covers the {(xj, yi)} E Aj and Nj is the length of the run. The discrepancy 
variable is the largest weight of any run 

R sr = max x r 2 un .• • 
o 

The explicit distribution of R sr , used to define the p-value, p = P(R sr > R®\N), is given 
in ^ for the case when (6, M) are fully specified (no fitting). A similar discrepancy variable 
can be defined for failure measurements, Rf r . 

To illustrate the definition we present a simple example. Suppose N = 5 observations at x 
positions (1,2,3,4,5) with standardized residuals (yi — f(xi\6, M))/ai given by 
(0.3,-0.1,-0.8,0.4,0.2). Then there are two success runs A\ = {(1,0.3)}, 
A 2 = {(4, 0.4), (5, 0.2)} and we find R sr = 0.16 + 0.04 = 0.2 due to the second run. Similarly, 
for the single failure run, Rf r = 0.65. 




(ID 



^^C 1 (yi-f(xi\9,M)^ 



2 
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3.4.3 x 2 tests f° r Poisson distributed data 



In the case of binned, Poisson distributed data, the quantity 

T~f At 

t=l 

can be used as a discrepancy variable where N b is the number of bins, the m, are the numbers 
of measured events and At = Aj(0, M) are the model expectations for event bin i; for an ex- 
pectation At the expected variance is Aj. This Pearson x 2 statistic [9] was originally proposed 
for multinomial data but has found wide use in analyzing Poisson distributed data. Rather than 
using the probability distribution of this discrepancy variable directly, P(x 2 \ N b ) is often used as 
an approximation for P(i£p)R The distribution of Rp has been shown to asymptotically reach 
a x 2 distribution for multinomially distributed data, giving some justification for this procedure. 
Practically, this is the case for data with a large number of entries rrii in all bins. When parame- 
ters are first estimated from a fit to the data, the parameter values which minimize Rp are used 
to calculate the Aj, and the p- value is evaluated using P(x 2 |iV fc — n). 

It is often seen that the expected weight, At, is replaced with the observed weight, rrii, in the 
denominator (Neyman x 2 1 10 1). The discrepancy variable is then 



rrii 

t=i 1 

Again, rather than using the probability distribution of this discrepancy variable directly, it 
is assumed that R^ has a distribution which approximates a x 2 distribution with Nb degrees 
of freedom and P(x 2 \Nb) is used as an approximation for P(Rn). In cases where rrii = 0, 
practitioners of this approach set rrii = 1 to avoid divergence. Sometimes bins with rrii = are 
ignored, which can lead to very misleading results since finding rrii = is valuable information. 
When parameters are first estimated from a fit to the data, the parameter values which minimize 
Rn are used, and the p- value is evaluated using P(x 2 \N b — n). 

Note that in both of these cases, we do not expect flat p-value distributions since only ap- 
proximations are used for P(R P / N ). The deviations from flatness are expected to be greatest 
when small event numbers are present in the data sets. 



3.4.4 Likelihood ratio test for Poisson distributed data 

Another option for a Poisson model is based on the (log of) the likelihood ratio p"TJ (sometimes 
referred to as the Cash statistic [ fl2| ) 



R c = 2\og 



P(x\\j = rrij) 
P(x\Xi = Xi(6)) 



i=l 



X l -m i + rrii log 



rrii 
A,- 



(14) 



5 The use of this approximation probably dates back to a time when complicated numerical calculations were 
not possible. 
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In bins where = 0, the last term is set to 0. Again, rather than using the probability distri- 
bution of this discrepancy variable directly, since Rc has a distribution which approximates a 
X 2 distribution with iV& degrees of freedom for large Aj, P(x 2 \Nb) is used as an approximation 
for P(Rc). The validity of this approximation is critically discussed in p3| . When parameters 
are first estimated from a fit to the data, the parameter values which maximize the likelihood 
are used, and the p- value is evaluated using P(x 2 \^b — n). Note that the distribution of R c 
asymptotically converges faster to the x 2 distribution than the distribution of Rp (see p"4][). 

3.4.5 Probability of the data test 

Any probability of the data can be chosen as the discrepancy variable: 

R L = P(x\6,M) . 

In this case, larger values of Rl imply better agreement with the data. The probability P(x\6, M) 
can be used to extract the probability for R L as 

P{R L )dR L = [ P(x\B, M)dx . 

J R L <P(x\8,M)<R L +dR L 

An example of how this is done numerically for Poisson distributed data and using 

P(m\e,M) = T\- — ^~ 

i 

is given in the Appendix. Once we have P(Rl\9, M) we can then evaluate 

P= f P{R L )dR L 

JRl<R1 

where R% is the value observed with the data set at hand. 

In a model with Gaussian uncertainties where we use a product of Gaussian densities, 

N 

P(x\9,M) oc \[P{ Xi \^ai) 

i 

where x,; ~ jV(//j, <7j), then taking R L = P(x\0, M) is equivalent to the usual x 2 test - 

If the model parameters are first fitted using the data, we propose the following correction 
for the number of fitted parameters: 

• calculate the p- value taking R% = P(x = D\9*, M), but assuming a simple hypothesis 
(no fitted parameters); 

• calculate the x 2 value which corresponds to this p-value using the inverse x 2 distribution 
corresponding to iV degrees of freedom; 
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• recalculate the p-value using the x 2 value and N — n degrees of freedom. 



This procedure is valid for the case where we have Gaussian uncertainties, but is ad-hoc for 
other cases. We test its usefulness below. 

Care must be taken in using R L as a discrepancy variable, particularly when it is written as 
the product of individual probability densities. The overall shape of the distribution is poten- 
tially not tested, and large p- values can be produced with incorrect model choices. An example 



of this is given below in Section 4.3 



3.4.6 Partial/Prior/Posterior-predictive p-value 



Rather than using the parameters at the mode of the posterior, it is also possible to define a 



p- value by averaging over the parameter values according to a probability distribution [ 15 1 



For the posterior-predictive case jT6j, assuming we are using a probability of the data as 
discrepancy variable, 



P 



Rl<R? 



P{R L \6,M)dR 1 



P(9\D,M)d9 



(15) 



While the partial posterior-predictive p- value [15] has the desirable property of a flat distri- 
bution on [0, 1], at least in the large sample limit as iV — > oo, it is not known in general how 
to compute it for realistic problems. Furthermore, the numerical effort required to evaluate the 



double integral in ( [15] ) quickly becomes prohibitive. Therefore, these p- value definitions are not 
considered here despite their appeal from a Bayesian perspective. 



3.4.7 Johnson test 

Johnson {Tt\ proposed a modification of the x 2 definition to take into account the posterior 
probability density for the parameters of a model. Rather than evaluating the probability of 
the data at fixed values of the parameters, the parameters are given values according to their 
probability density after evaluating the data. Johnson's statistic, Rj, is expected to behave 
asymptotically as a x 2 distribution with N b — 1 degrees of freedom, where N b is a number 
of bins to be defined, regardless of the number of parameters. Imagine the data is given by a 
vector of values x and these values are expected to deviate from the model prediction according 
to individual probabilities P(xj\9, M). The Johnson prescription is to define N b bins, where 
bin i contains a probability P;. Intervals Axi j are defined for each data point j via 

Pi(9) = j P(x j \9,M)dx j . 

J Axi j 

The intervals Ax^j cover the full range of possible values for each Xj and are ordered so that 
they include increasing values of x. The definition of the intervals depends on the values of 
the parameters 9 and vary for each data point j. The parameter values are to be sampled from 
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the full posterior probability P(9\D, M). For each 6, a discrepancy variable Rj is calculated 
according to 
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Rj = y^ =5 — — (i6) 

tr nps 

where m, is the actual number of data points falling within the intervals Ascy and N is the total 
number of data points in the data set. In [17] it is shown that asymptotically Rj is distributed 
as a x 2 variable with Nb — 1 degrees of freedom, regardless of the number of fitted parameters 
n. Hence the p-value is calculated using P(x 2 |iV b — 1). 

For data where the x follow continuous probability densities the bins are usually chosen to 
have equal probabilities. The number of bins to be chosen is given by a rule of thumb due to 
Mann/Wald JT8| | with a modification for small number of bins such that there are at least three: 

N b = [ e °- 4 'W + 2j . 

This means by default we have N b (N = 25) = 5, N b (N = 100) = 8, N b (N = 1, 000) = 17. 

In case the values of x follow a discrete distribution it is usually not possible to have equal 
probabilities for all P, ; . In this case, a randomization procedure is used to allocate data points to 
bins. 



4 Test of p- value definitions 

In the following, we test the usefulness of the different p- values given above by looking at their 
distributions for specific examples motivated from common situations faced in experimental 
physics. We first consider a data set which consists of a background known to be smoothly 
rising and, in addition to the background, a possible signal. This could correspond for example 
to an enhancement in a mass spectrum from the presence of a new resonance. The width of the 
resonance is not known, so that a wide range of widths must be allowed for. Also, the shape 
of the background is not well known. We do not have an exhaustive set of models to compare 
and want to look at GoF's for models individually to make decisions. In this example, we will 
first assume that distributions of the data relative to expectations are modeled with Gaussian 
distributions. We then consider the same problem with small event numbers, so that Poisson 
statistics are appropriate, and again test our different p-value definitions. These examples were 
also discussed in p9| . Finally, we consider the case of testing an exponential decay law on a 
sample of measured decay times. 
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4.1 Data with Gaussian uncertainties 



4.1.1 Definition of the data 

A typical data set is shown in Fig. |T| It is generated from the function 

f{xi) = A + Bx l + Cx 2 i +^^e~ {JE ^ , (17) 

o\J2tx 

with parameter values (A = 0, B = 0.5, C = 0.02, D = 15, a = 0.5, /i = 5.0). The yi are 
generated from /(xj) as 

Vi = f(x t ) + Zi-A 
where z% is sampled according to Af(0, 1). 

>, 35 
30 
25 
20 
15 
10 

5 


-5 

2 4 6 8 10 12 14 16 18 20 

x 

Figure 1: Example data set for the case N = 25 with Gaussian fluctuations. The fits of the four 
models are superposed on the data. 

The x domain is [0, 20] and two cases were considered: N = 25 data points evenly sampled 
in the range, and iV = 100 data points evenly sampled in the range. The experimental resolution 
(Gaussian with width 4) is assumed known and correct. Ensembles consisting of 10,000 data 
sets with N = 25 or N = 100 data points each were generated and four different models were 
fitted to the data. Table [T] summarizes the parameters available in each model and the range over 
which they are allowed to vary. In all models, fiat priors were assumed for all parameters for 
ease of comparison between results. 

The models were fitted one at a time. Two different fitting approaches were used: 



• Data 

Model I 

--- Model II 




1) The fitting was performed using the gradient-based fitter MIGRAD from the MINUIT 
package [20| accessed from within the Bayesian Analysis Toolkit (BAT) [19|. The start- 
ing values for the parameters were chosen at the center of the allowed ranges given in 
Table EJ 
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Table 1 : Summary of the models fitted to the data, along with the ranges allowed for the param- 
eters. 
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2) The Markov Chain Monte Carlo (MCMC) implemented in BAT was run with its default 
settings first, and then MIGRAD was used to find the mode of the posterior using the 
parameters at the mode found by the MCMC as starting point. 

Since it is known that the distributions of the data are Gaussian and flat priors were used for 
the parameters, the maximum of the posterior probability corresponds to the minimum of % 2 . 
The p- values were therefore extracted for each model fit using the x 2 probability distribution as 
discussed in Sec. 13.4. ll 





Figure 2: p- value distributions based on R G . The parameters of the models discussed in Table[T] 
are fitted to iV = 25 data points and allowing parameter values in the small range. The p- values 
correspond to N — n degrees of freedom, where n is the number of fitted parameters. 



4.1.2 Comparison of p- values 

The p- value distributions for the four models are given in Fig. [2] for the case iV = 25 and 
using the small fit range from Table [T] Two different histograms are shown for each model, 
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corresponding to the two fitting techniques described above. The distributions for models I and 
II are peaked at small values of p for MIGRAD only and for MCMC+MIGRAD, and the models 
would be disfavored in most experiments. For models III and IV, there is a significant difference 
in the p-value distribution found from fitting using only MIGRAD, or MCMC+MIGRAD, with 
the p-value distribution in the latter case being much flatter. An investigation into the source of 
this behavior showed that the p- value distribution from MIGRAD is very sensitive to the range 
allowed for the parameters. This can be seen in Fig. [3] where the same fits were performed but 
now using the large ranges for the parameters (see table). In the case where larger parameter 
ranges are allowed, MIGRAD will converge to parameter values which are not at the global 
mode more often, and the choice of the starting point and the (initial) step size for the fit are 
crucial. Even in this rather simple fitting problem, it is seen that the use of the MCMC can make 
a significant difference in the fitting results. 




Figure 3: p- value distributions based on R G . The parameters of the models discussed in Table [T] 
are fitted to N = 25 data points and allowing parameter values in the large range. The p- values 
correspond to N — n degrees of freedom, where n is the number of fitted parameters. 

The results from the MCMC+MIGRAD fits also depended on the fit range, although to a 
lesser extent. Figure [4] shows the p- value distributions from \ 2 fits to N = 25 data points for 
Model IV for the two parameter ranges. There are small but nevertheless significant differences 
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in the distribution, indicating that the parameter range is also an important factor in the p- value 
distribution. Note that we are not expecting fiat p-value distributions since the correction for 
the number of fitted parameters is only valid for models linearly dependent on the parameters. 



This is not the case here since we have a Gaussian term in the model (see 17 ) 
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Figure 4: p-value distributions based on Rq for model IV. The parameters of the model are fitted 
to iV = 25 data points and allowing parameter values in the two ranges range. The p-values 
correspond to iV — n degrees of freedom, where n is the number of fitted parameters. 

Focusing on the results from the MCMC+MIGRAD fits, models III and IV give flat p- 
value distributions and would have a large DoB in a majority of experiments. In general, these 
distributions satisfy our expectations and these p- values can be used to update our DoBs in the 
models as described in Section l33l 

The p-value distributions for the four models and using the Johnson discrepancy variable 
are given in Fig. [5] Note that in the results shown for this descrepancy variable, we have used 
9* rather than sampling 9 according to the posterior. We verified that this did not significantly 
change the p- value distribution of this discrepancy variable. The distributions show spiky be- 
havior, which becomes much smoother as the number of data points increases. However, there 
is still very little discriminating power between the models using this discrepancy variable. This 
was generally the case for other examples considered, and we therefore do not show any further 
results from the Johnson discrepancy variable in this paper. 

For all further results presented in this paper, we have performed MIGRAD only fits but 
with starting parameter values located near the global mode (whose location can be estimated 
since the underlying model is known). Only the smaller fit ranges were used. In this way, 
we approximate the results which would be achieved with optimal fitting. This allows us to 
generate and analyze larger data sets, and thus have more sensitivity to the shape of the p- value 
distributions. 

The p-value distributions for the four models and using the 'runs' discrepancy variable are 
given in Fig. [6} In this figure, there are two p-value distributions in each plot since we consider 
both cases where we have runs of data points below the expectations, and runs of data points 
above the expectations. For the correct model, the joint distribution of p(R sr ) and p(Rf r ) should 
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Figure 5: p-value distributions based on Rj. The parameters of the models discussed in Table[T] 
are fitted to N = 25 data points and allowing parameter values in the small range. The p- values 
correspond to N b — 1 degrees of freedom, where N b = 5 is the number of bins. 
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be symmetric. The p-values should generally not be too small. The distribution of p(R sr ) vs. 
p(Rfr) is shown in Fig. [7] for the example under consideration. The biasing of the p- values 
resulting from the fitting of parameters is apparent. In using Rf r and R sr , rather large p- values 
should therefore be expected for valid models. 



Model I 



— Success 
Failure 





Figure 6: p-value distributions based on R sr and Rf r . The parameters of the models discussed 
in Table[TJare fitted to N = 25 data points and allowing parameter values in the small range. 

The results for the ensembles with N = 100 data points are shown in Figs. [8] and [9] The 
models I and II now usually result in very small p-values using both the standard \ 2 an d runs 
discrepancy variables. 

For the \ 2 fi ts > me p- value distribution for model IV is again rather fiat, as expected, but it 
would generally be difficult to conclude that model III is not adequate. Although this p- value 
distribution is falling, as opposed to the case where we only had N = 25 data points, there is 
still a large probability to get a sizable p- value. 

The runs statistic has a sharply peaked distribution near p = (for success runs) for models 
1,11, and should therefore have better discriminating power than the standard x 2 test. In other 
words, it should more often lead to the desired result that the incorrect models are discarded. 
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Figure 7: Joint distribution of the p- values for success and failure runs. The results are for 
N = 25 data points. Marker sizes in each bin are chosen relative to the bin with the highest 
probability of the individual model. Bins with probability less than 3.5- 10~ 3 have been excluded 
from the plot for the purpose of clarity. 



For model III, the success runs variable behaves similarly to the x 2 test > but the failure runs 
variable is rather fiat. For the correct underlying model, model IV, we see that both success and 
failure runs variables have similar p- value distributions. 
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Figure 8: p- value distributions based on Rq. The parameters of the models discussed in Table [T] 
are fitted to iV = 100 data points and allowing parameter values in the small range. The p- values 
correspond to N — n degrees of freedom, where n is the number of fitted parameters. 
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Figure 9: p-value distributions based on R sr and Rf r . The parameters of the models discussed 
in Table [T] are fitted to N = 100 data points and allowing parameter values in the small range. 
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4.2 Example with Poisson uncertainties 



4.2.1 Definition of the data 



We again use the function f(x) (see ([17])), but now generate data sets with fluctuations from 
a Poisson distribution. The function is used to give the expected number of events in a bin 
covering an x range, and the observed number of events follows a Poisson distribution with this 
mean. For N b = 25 bins, e.g., the third bin is centered at x% = 2.0 and extends over [1.6, 2.4]. 
The expected number of events in this bin is defined as A 3 = J 2 '^ ^f(x)dx . 



4.2.2 Comparison of p- value distributions 



For the Poisson case we consider the four discrepancy variables described in sections 3.4.3 



3.4.5 to compare the p- value distributions for the four models (I-IV). We also show the p- value 
distribution for the 'true' model for comparison, where the p-value is calculated by evaluating 

R = R(x\6 truc ,M). 

Different fits were performed for each of the different discrepancy variable definitions. For 
the x 2 discrepancy variables, x 2 minimization was performed using either the expected or ob- 
served number of events as weight. For the likelihood ratio test, a maximum likelihood fit was 
performed. The likelihood was defined as 

N b p -Xi\mi 

P(m\e,M) = H— J- (18) 

Hl>i- 

1=1 

where m,j is the observed number of events in a bin and Aj = Xi(9). Since we use flat priors for 
the parameters, the same results were used for the case where the discrepancy variable is the 



probability of the data. A typical likelihood fit result is shown in Fig. 10 together with the data 



while the p- value distributions are given in Figs. 11 12 



For all definitions of p-values considered here, models 1,11 generally lead to low DoBs, while 
models III and IV generally have high DoBs. However, there are differences in the behavior of 
the p-values which we now discuss in more detail. 



The lower left panel in Figs. 11 12 show the p- value distribution taking the true model 



and the true parameters, and can be used as a gauge of the biases introduced by the p-value 



definition. There is a bias for all discrepancy variables shown in Fig. 1 1 because approximations 
are used for the probability distribution of the discrepancy variable (using P(x 2 |iV) rather than 
P(Rp\N b ) or P(R N \N b ). This bias is usually small, except in the case where the Neyman x 2 
discrepancy variable is used. In this case, the probability of a very small p-value is much too 
high even with the true parameters. The p-value distribution using the probability of the data, 
on the other hand, is flat when the true parameters are used as seen in Fig. [12} so that this choice 
would be optimal in cases where no parameters are fit. 
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Figure 10: Example fits to one data set for the four models defined in Table [T] for N b = 25 
and the small parameter range. The lines shows the fit functions evaluated using the best-fit 
parameter values. 



4.2.3 Pearson x 2 and Neyman x 2 



Comparison of the behavior of the Pearson x 2 to the Neyman x 2 , Fig. 1 1 clearly shows that 
using the expected number of events as weight is superior. The spike at in the p-value distri- 
bution when using the observed number of events indicates that this quantity does not behave 
as expected for a x 2 distribution when dealing with small numbers of events, and will lead to 
undesirable conclusions more often than anticipated. The behavior of the Pearson x 2 using the 
expected number of events for the different models is quite satisfactory, and this quantity makes 
a good discrepancy variable even in this example with small numbers of events in the bins. Both 
of these p- value distributions become flat in the case of large number of events in all bins. 



4.2.4 Likelihood Ratio 



While models 1,11 are strongly peaked at small p- values, for this definition of a p- value models 
IIIJV also have a falling p-value distribution. This is somewhat worrisome. In assigning a 
DoB to models based on this p- value, this bias towards smaller p- values should be considered, 
otherwise good models will be assigned too low a DoB. The behavior of the p- value extracted 
from Pearson's x 2 is preferable to the likelihood ratio, as it will lead to value judgments more 
in line with expectations. 



4.2.5 Probability of the Data 



Using this definition, the p-value distribution is flat for the true model, since this is the only 
case where the correct probability distribution of the discrepancy variable is employed. For the 
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Figure 11: p-value distributions based on Rp, and Rc- The parameters of the models 
discussed in Table [T] are fitted to N b = 25 bins and allowing parameter values in the small 
range. The p-values correspond to N — n degrees of freedom, where n is the number of fitted 
parameters. 
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fitted models, two p-value distributions are shown - one uncorrected for the number of fitted 
parameters, and one corrected for the fitted parameters as described in Section 3.4.5 Using the 
uncorrected p-values, both models III and IV show p-value distributions peaking at p = 1, so 
that both models would typically be kept. As expected, the correction for the number of fitted 
parameters pushes p to lower values, and produces a nearly flat distribution for model IV. This 
ad-hoc correction works well for this example. In general, this p-value definition has similar 
properties to the Pearson x 2 statistic, but with the advantage of a flat p-value distribution when 
the true model was used and no parameters were fit. 



4.3 Exponential Decay 

When dealing with event samples with continuous probability distributions for the measurement 
variables, it is common practice when determining the parameters of a model to use a product 
of event probabilities (unbinned likelihood): 

N 

P(x\9,M) = Y[P(xi\6,M) . 

i=i 

If the model chosen is correct, then this definition for the probability of the data (or likelihood) 
can be used successfully to determine appropriate ranges for the parameters of the model. How- 
ever, P(x\9,M) defined in this way has no sensitivity to the overall shape of the distribution 
and can lead to unexpected results if this quantity is used in a GoF test of the model in question. 
We use a common example, exponential decay, to illustrate this point (see also discussions in 
(6) and (2TJ). 



Our model is that the data follows an exponential decay law. We measure a set of event 
times, t, and analyze these data to extract the lifetime parameter r. We define two different 
probabilities of the data D = {ti} (i — 1 . . . N) 



I Unbinned likelihood 



N 

p(zv)=n^" /r . 

8=1 



II Binned likelihood 



N b \7Tli fti+At AT 

i=1 Jti T 



e 



In the first case, the probability density is a product of the densities for the individual events, 
while in the second case the events are counted in time intervals and Poisson probabilities are 
calculated in each bin. The expected number of events is normalized to the total number of 
observed events. We consider time intervals with a width At = 1 unit and time measurements 
ranging from t = to t = 20. The overall probability is the product of the bin probabilities. 
For each of these two cases, we consider the p- value determined from the distribution of R = 
P(D\t). 

In order to make the point about the importance of the choice for the discrepancy variable 
for GoF tests, we generated data which do not follow an exponential distribution. The data is 



generated according to a linearly rising function, and a typical data set is shown in Fig. 13 
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Figure 12: p-value distributions based on R L . The parameters of the models discussed in Table[T] 
are fitted to N b = 25 bins and allowing parameter values in the small range. The p-values 
correspond to N b and N b — n degrees of freedom, where n is the number of fitted parameters. 
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Figure 13: Typical data set used for the fitting of the exponential model. The individual event 
times are shown, as well as the binned contents as defined in the text. The best fit exponential 
from the unbinned likelihood is also shown on the plot, normalized to the total number of fitted 
events. 



4.3.1 Product of exponentials 



If the data are fitted using a flat prior probability for the lifetime parameter, then we can solve 
for the mode of the posterior probability of r analytically, and get the well-known result 



Defining £ = U, so r* = £/N, we can also solve analytically for the p-value: 



V 



[ dt[ [ dt' 2 ...(r*y N e-^ /T * 



and the result is 

p—l — P (N, N) 
with the regularized incomplete Gamma function 

Pis X)- 1{S > X) - fo tS ~ le ~ tdt 
l ' ' V{s) fft'-ie-W 

Surprisingly, as N increases p is approximately constant with a value p ps 0.5. Regardless of the 
actual data, p is never small and depends only on N. The best fit exponential is compared to the 



data in Fig. 13 and yields a large p- value although the data is not exponentially distributed. This 
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p-value definition is clearly useless, and the unbinned likelihood is seen to not be appropriate 
as a GoF statistic. 

The definition of x 2 m me Gaussian data uncertainties case also results from a product of 
probability densities, so the question as to what is different in this case needs clarification. The 
point is that in the case of fitting a curve to a data set with Gaussian uncertainties, the data are 
in the form of pairs, {(xj, Ui)}, where the measured value of y is assigned to a particular value 
of x, and the discrepancy between y and f(x) is what is tested. In other words, the value of 
the function is tested at different x so the shape is important. In the case considered here, the 
data are in the form {xi}, and there is no measurement of the value of the function f(x) at a 
particular x. The orderings of the Xi is irrelevant, and there is no sensitivity to the shape of the 
function. 



4.3.2 Product of Poisson probabilities 

In this case, the p- value for the model is determined from the product of Poisson probabilities 
using event counting in bins as described above. Since the expected number of events now 
includes an integration of the probability density and cover a wide range of expectations, it is 
sensitive to the distribution of the events and gives a valuable p-value for GoF. The p-value 
using R P for the data shown in Fig. [13] is p = within the precision of the calculation. The 
exponential model assumption is now easily ruled out. 

In comparison to the unbinned likelihood case, the data are now in the form m, where the 
rrii now refer to the number of events in a well-defined bin i which defines an x range. In other 
words, we have now a measurement of the height of the function f(x) at particular values of x, 
and are therefore sensitive to the shape of the function. 

As should be clear from this example, the choice of discrepancy variable can be very impor- 
tant in GoF decisions: Maximum Likelihood estimation with unbinned data will always give an 
optimal parameter estimate in terms of bias and variance, but it will give no information about 
the correctness of the model. 



5 Discussion 

We have investigated a number of possible discrepancy variables and p-value definitions for 
Goodness-of-Fit tests, with the understanding that these p-values are to be used to make judg- 
ments on the acceptability of models. The sense in which the p- values are used follows Bayesian 



logic as described in Section 3.3 For this purpose, the p- value distributions should be reason- 
ably uniform for models which are considered good and they should peak at small values for 
poor models. Using these requirements to guide us, we classify our results in Table [2] for the 
example data sets and models described in the text. Based on our experience with these exam- 
ples, we also summarize our suggestions for the use of the different discrepancy variables and 
p- value definitions considered here. 

Most common p- values use approximations for the distribution of the discrepancy variable. 
This leads to non-flat distributions of the p- value even when no parameters are fitted, and these 
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deviations can be severe in some cases. We discussed the use of the probability of the data 
itself as a discrepancy variable, and showed that it is flat in the case of no fitted parameters 
and Poisson distributed data. This was due to the use of the correct probability distribution for 
the discrepancy variable. The algorithm described in the Appendix can be used to generate the 
correct p- value distribution for any discrepancy variable for Poisson distributed data. 

For composite models, where parameters of the model are fit to the data before a p- value is 
calculated, we find that p- value distributions can depend strongly on the technical approach used 
to fit the data. Using gradient-based approaches to find minima or maxima requires considerable 
attention from the user and generally fine tuning of fit ranges and starting values. The fine- 
tuning will certainly affect the p-value distributions, making their use more difficult. Setting 
range limits effectively means defining a prior, and impacts p-value distributions. First fitting 
the data with a Markov Chain Monte Carlo before using MIGRAD gave considerably better 
results, in the sense that the p-value distributions extracted in this way followed expectations. 
We therefore recommend using a MCMC to map out the parameter space as a start to the fitting 
procedure in situations where giving good starting values for a gradient-based fit is difficult. 
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A Fast £>-value evaluation in MCMC for Poisson distributed 
data 



The peak probability for a Poisson distribution 

P{m\X) -- 



ml 



occurs for m = [X\. If the probability distribution of a data set is modeled as a product of 
Poisson terms, then the highest probability is given for {wij} = { [A^J }. We use this to define 
the starting point for a Markov Chain, and move the bin contents up or down (chosen randomly) 
at each iteration. For each attempted change in the bin content, we apply the usual Metropolis 
test J22J . The probability P(m\X) is easily updated at each change. E.g., if the result in bin i 
increases from m, to m'^ then the probability changes by 



JTTli. .m' i —rrii 



p p^-X 

m-! 

A large number of experiments can be quickly simulated and the p- value extracted. 
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